First look at the data:

str(houses)
## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley        : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
##  $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
##  $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
##  $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
##  $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
##  $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
##  $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
##  $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
##  $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
##  $ BsmtQual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
##  $ BsmtCond     : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
##  $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
##  $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
##  $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
##  $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence        : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
##  $ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
dim(houses)
## [1] 1460   81
head(houses)
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000
houses$MSSubClass <- as.factor(houses$MSSubClass)

Let us look at the missing data:

miss <- apply(houses, 2, is.na) %>% apply(., 2, sum)
miss[miss > 0.5 * nrow(houses)]  # more than 50% of data is NA
##       Alley      PoolQC       Fence MiscFeature 
##        1369        1453        1179        1406

However, in all cases, where NAs are more than 50% of the data, they do not mean missing values. They represent the ‘no’/‘none’ categories in the variables. In the analysis, consider merging the categories to create binary variables (yes and no).

The analysis of continuous (numerical) variables

num.vars <- Filter(is.numeric, houses) %>% names()  # 37
num.vars <- num.vars[-1]  # delete id

summary(houses[num.vars])
##   LotFrontage        LotArea        OverallQual      OverallCond   
##  Min.   : 21.00   Min.   :  1300   Min.   : 1.000   Min.   :1.000  
##  1st Qu.: 59.00   1st Qu.:  7554   1st Qu.: 5.000   1st Qu.:5.000  
##  Median : 69.00   Median :  9478   Median : 6.000   Median :5.000  
##  Mean   : 70.05   Mean   : 10517   Mean   : 6.099   Mean   :5.575  
##  3rd Qu.: 80.00   3rd Qu.: 11602   3rd Qu.: 7.000   3rd Qu.:6.000  
##  Max.   :313.00   Max.   :215245   Max.   :10.000   Max.   :9.000  
##  NA's   :259                                                       
##    YearBuilt     YearRemodAdd    MasVnrArea       BsmtFinSF1    
##  Min.   :1872   Min.   :1950   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:1954   1st Qu.:1967   1st Qu.:   0.0   1st Qu.:   0.0  
##  Median :1973   Median :1994   Median :   0.0   Median : 383.5  
##  Mean   :1971   Mean   :1985   Mean   : 103.7   Mean   : 443.6  
##  3rd Qu.:2000   3rd Qu.:2004   3rd Qu.: 166.0   3rd Qu.: 712.2  
##  Max.   :2010   Max.   :2010   Max.   :1600.0   Max.   :5644.0  
##                                NA's   :8                        
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       X1stFlrSF   
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Min.   : 334  
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   1st Qu.: 882  
##  Median :   0.00   Median : 477.5   Median : 991.5   Median :1087  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4   Mean   :1163  
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2   3rd Qu.:1391  
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0   Max.   :4692  
##                                                                    
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr    TotRmsAbvGrd      Fireplaces     GarageYrBlt  
##  Min.   :0.000   Min.   : 2.000   Min.   :0.000   Min.   :1900  
##  1st Qu.:1.000   1st Qu.: 5.000   1st Qu.:0.000   1st Qu.:1961  
##  Median :1.000   Median : 6.000   Median :1.000   Median :1980  
##  Mean   :1.047   Mean   : 6.518   Mean   :0.613   Mean   :1979  
##  3rd Qu.:1.000   3rd Qu.: 7.000   3rd Qu.:1.000   3rd Qu.:2002  
##  Max.   :3.000   Max.   :14.000   Max.   :3.000   Max.   :2010  
##                                                   NA's   :81    
##    GarageCars      GarageArea       WoodDeckSF      OpenPorchSF    
##  Min.   :0.000   Min.   :   0.0   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:1.000   1st Qu.: 334.5   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median :2.000   Median : 480.0   Median :  0.00   Median : 25.00  
##  Mean   :1.767   Mean   : 473.0   Mean   : 94.24   Mean   : 46.66  
##  3rd Qu.:2.000   3rd Qu.: 576.0   3rd Qu.:168.00   3rd Qu.: 68.00  
##  Max.   :4.000   Max.   :1418.0   Max.   :857.00   Max.   :547.00  
##                                                                    
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     MiscVal             MoSold           YrSold       SalePrice     
##  Min.   :    0.00   Min.   : 1.000   Min.   :2006   Min.   : 34900  
##  1st Qu.:    0.00   1st Qu.: 5.000   1st Qu.:2007   1st Qu.:129975  
##  Median :    0.00   Median : 6.000   Median :2008   Median :163000  
##  Mean   :   43.49   Mean   : 6.322   Mean   :2008   Mean   :180921  
##  3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009   3rd Qu.:214000  
##  Max.   :15500.00   Max.   :12.000   Max.   :2010   Max.   :755000  
## 

We see that some variables have only few possible values (e.g. BsmtFullBath, BsmtHalfBath, FullBath, HalfBath,KitchenAbvGr, Fireplaces, MoSoldby). In further analysis, these variables may be considered as factors.

Let us now look at the correlation structure of numerical variables.

corr <- cor(houses[num.vars]) %>% round(2)
ggcorrplot(corr, type = 'lower')

We see high correlation between variables that are obviously linked together, e.g. GarageCars and GarageArea, YearBuild and YearRemodAd, etc.

The analysis of categorical variables

fac.vars <- Filter(is.factor, houses) %>% names ()  #44

summary(houses[fac.vars])
##    MSSubClass     MSZoning     Street      Alley      LotShape  LandContour
##  20     :536   C (all):  10   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63   
##  60     :299   FV     :  65   Pave:1454   Pave:  41   IR2: 41   HLS:  50   
##  50     :144   RH     :  16               NA's:1369   IR3: 10   Low:  36   
##  120    : 87   RL     :1151                           Reg:925   Lvl:1311   
##  30     : 69   RM     : 218                                                
##  160    : 63                                                               
##  (Other):262                                                               
##   Utilities      LotConfig    LandSlope   Neighborhood   Condition1  
##  AllPub:1459   Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260  
##  NoSeWa:   1   CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81  
##                FR2    :  47   Sev:  13   OldTown:113   Artery :  48  
##                FR3    :   4              Edwards:100   RRAn   :  26  
##                Inside :1052              Somerst: 86   PosN   :  19  
##                                          Gilbert: 79   RRAe   :  11  
##                                          (Other):707   (Other):  15  
##    Condition2     BldgType      HouseStyle    RoofStyle       RoofMatl   
##  Norm   :1445   1Fam  :1220   1Story :726   Flat   :  13   CompShg:1434  
##  Feedr  :   6   2fmCon:  31   2Story :445   Gable  :1141   Tar&Grv:  11  
##  Artery :   2   Duplex:  52   1.5Fin :154   Gambrel:  11   WdShngl:   6  
##  PosN   :   2   Twnhs :  43   SLvl   : 65   Hip    : 286   WdShake:   5  
##  RRNn   :   2   TwnhsE: 114   SFoyer : 37   Mansard:   7   ClyTile:   1  
##  PosA   :   1                 1.5Unf : 14   Shed   :   2   Membran:   1  
##  (Other):   2                 (Other): 19                  (Other):   2  
##   Exterior1st   Exterior2nd    MasVnrType  ExterQual ExterCond  Foundation 
##  VinylSd:515   VinylSd:504   BrkCmn : 15   Ex: 52    Ex:   3   BrkTil:146  
##  HdBoard:222   MetalSd:214   BrkFace:445   Fa: 14    Fa:  28   CBlock:634  
##  MetalSd:220   HdBoard:207   None   :864   Gd:488    Gd: 146   PConc :647  
##  Wd Sdng:206   Wd Sdng:197   Stone  :128   TA:906    Po:   1   Slab  : 24  
##  Plywood:108   Plywood:142   NA's   :  8             TA:1282   Stone :  6  
##  CemntBd: 61   CmentBd: 60                                     Wood  :  3  
##  (Other):128   (Other):136                                                 
##  BsmtQual   BsmtCond    BsmtExposure BsmtFinType1 BsmtFinType2  Heating    
##  Ex  :121   Fa  :  45   Av  :221     ALQ :220     ALQ :  19    Floor:   1  
##  Fa  : 35   Gd  :  65   Gd  :134     BLQ :148     BLQ :  33    GasA :1428  
##  Gd  :618   Po  :   2   Mn  :114     GLQ :418     GLQ :  14    GasW :  18  
##  TA  :649   TA  :1311   No  :953     LwQ : 74     LwQ :  46    Grav :   7  
##  NA's: 37   NA's:  37   NA's: 38     Rec :133     Rec :  54    OthW :   2  
##                                      Unf :430     Unf :1256    Wall :   4  
##                                      NA's: 37     NA's:  38                
##  HeatingQC CentralAir Electrical   KitchenQual Functional  FireplaceQu
##  Ex:741    N:  95     FuseA:  94   Ex:100      Maj1:  14   Ex  : 24   
##  Fa: 49    Y:1365     FuseF:  27   Fa: 39      Maj2:   5   Fa  : 33   
##  Gd:241               FuseP:   3   Gd:586      Min1:  31   Gd  :380   
##  Po:  1               Mix  :   1   TA:735      Min2:  34   Po  : 20   
##  TA:428               SBrkr:1334               Mod :  15   TA  :313   
##                       NA's :   1               Sev :   1   NA's:690   
##                                                Typ :1360              
##    GarageType  GarageFinish GarageQual  GarageCond  PavedDrive  PoolQC    
##  2Types :  6   Fin :352     Ex  :   3   Ex  :   2   N:  90     Ex  :   2  
##  Attchd :870   RFn :422     Fa  :  48   Fa  :  35   P:  30     Fa  :   2  
##  Basment: 19   Unf :605     Gd  :  14   Gd  :   9   Y:1340     Gd  :   3  
##  BuiltIn: 88   NA's: 81     Po  :   3   Po  :   7              NA's:1453  
##  CarPort:  9                TA  :1311   TA  :1326                         
##  Detchd :387                NA's:  81   NA's:  81                         
##  NA's   : 81                                                              
##    Fence      MiscFeature    SaleType    SaleCondition 
##  GdPrv:  59   Gar2:   2   WD     :1267   Abnorml: 101  
##  GdWo :  54   Othr:   2   New    : 122   AdjLand:   4  
##  MnPrv: 157   Shed:  49   COD    :  43   Alloca :  12  
##  MnWw :  11   TenC:   1   ConLD  :   9   Family :  20  
##  NA's :1179   NA's:1406   ConLI  :   5   Normal :1198  
##                           ConLw  :   5   Partial: 125  
##                           (Other):   9

Variable Utilities is totally useless as it contains only two categories and in one of them only one observation. Condition2 might be also considered as useless - small numbers of observations in some categories. In further analysis, consider merging categories for some variables as some of them contain only few observations (e.q. quality).

Let us look more closely at the target variable SalePrices.

summary(houses$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

This is the histogram and the estimation fo its density:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We see that the distribution is skewed.

Now, we plot the boxplots for sale prices per categories of all categorical variables.

Now, we look at the correlation of sale prices with the rest of numerical variables.

(cor.sp <- corr[, "SalePrice"])
##   LotFrontage       LotArea   OverallQual   OverallCond     YearBuilt 
##            NA          0.26          0.79         -0.08          0.52 
##  YearRemodAdd    MasVnrArea    BsmtFinSF1    BsmtFinSF2     BsmtUnfSF 
##          0.51            NA          0.39         -0.01          0.21 
##   TotalBsmtSF     X1stFlrSF     X2ndFlrSF  LowQualFinSF     GrLivArea 
##          0.61          0.61          0.32         -0.03          0.71 
##  BsmtFullBath  BsmtHalfBath      FullBath      HalfBath  BedroomAbvGr 
##          0.23         -0.02          0.56          0.28          0.17 
##  KitchenAbvGr  TotRmsAbvGrd    Fireplaces   GarageYrBlt    GarageCars 
##         -0.14          0.53          0.47            NA          0.64 
##    GarageArea    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch 
##          0.62          0.32          0.32         -0.13          0.04 
##   ScreenPorch      PoolArea       MiscVal        MoSold        YrSold 
##          0.11          0.09         -0.02          0.05         -0.03 
##     SalePrice 
##          1.00

These are the variables where the correlation with sale prices is higher than 0.5.

cor.sp.high
##  OverallQual    YearBuilt YearRemodAdd  TotalBsmtSF    X1stFlrSF    GrLivArea 
##         0.79         0.52         0.51         0.61         0.61         0.71 
##     FullBath TotRmsAbvGrd   GarageCars   GarageArea    SalePrice 
##         0.56         0.53         0.64         0.62         1.00

Let us plot the highly correlated variables versus the sale prices.

We see some trends.